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Abstract 

In this paper we propose a framework for 
predieting kernelized elassifiers in the vi¬ 
sual domain for eategories with no training 
images where the knowledge comes from 
textual description about these categories. 
Through our optimization framework, the 
proposed approach is capable of embed¬ 
ding the class-level knowledge from the 
text domain as kernel classifiers in the vi¬ 
sual domain. We also proposed a distri¬ 
butional semantic kernel between text de¬ 
scriptions which is shown to be effective 
in our setting. The proposed framework is 
not restricted to textual descriptions, and 
can also be applied to other forms knowl¬ 
edge representations. Our approach was 
applied for the challenging task of zero- 
shot learning of fine-grained categories 
from text descriptions of these categories. 

1 Introduction 


We propose a framework to model kernelized clas¬ 
sifier prediction in the visual domain for categories 
with no training images, where the knowledge 
about these categories comes from a secondary do¬ 
main. The side information can be in the form 
of textual, parse trees, grammar, visual representa¬ 
tions, concepts in the ontologies, or any form; see 
Fig [T] Our work focuses on the unstructured text 
setting. We denote the side information as “priv¬ 
ileged” information, borrowing the notion from 
( [Vapnik and Vashist, 2009[ ). 

Our framework is an instance of the concept 
of Zero Shot Learning (ZSL)(|L^ochelle et ah. 


20081, aiming at transferring knowledge from seen 


classes to novel (unseen) classes. Most zero-shot 
learning applications in practice use symbolic or 


numeric visual attribute vectors (Lamport et ah. 


2014t Lamport et ah, 20091. In contrast, re¬ 


cent works investigated other forms of descrip- 



Figure I: Our setting where machine can predict 
unseen class from pure unstructured text 


tions, e.g. user provided feedback (|Wah and Be- 


longie, 201 3|l, textual descriptions (Elhoseiny et 


ah, 20131. It is common in zero-shot learning 


to introduce an intermediate layer that facilitates 
knowledge sharing between seen classes, hence 
the transfer of knowledge to unseen classes. Typi¬ 
cally, visual attributes are being used for that pur¬ 
pose, since they provide a human-understandable 
representation, which enables specifying new cat- 


egories (Lamport et 

ah, 20141 Lampert et ah. 

2009 Farhadi et ah, 20091 Palatucci et ah, 2009 

Akata et ah, 2013 

Li et ah, 2014l. A funda- 


mental question in attribute-based ZSL models is 
how to define attributes that are visually discrim¬ 
inative and human understandable. Researchers 
has explored learning attributes from text sources, 
e.g. ( Rohrbach, 2014t [Rohrbach et ah, 2013 


Rohrbach et ah, 2010| Berg et ah, 2010[ |. Other 


works have explored interactive methodologies to 
learning visual attribute that are human under¬ 
standable, e.g. ([Parikh and Grauman, 201 1||. 


There are several differences between our pro¬ 
posed framework and the state-of-the-art zero-shot 
learning approaches. We are not restricted to use 
attributes as the interface to specify new classes. 
We can use any “privileged” information available 
for each category. In particular in this paper we 
focus on the case of textual description of cate¬ 
gories as the secondary domain. This difference 
is reflected in our zero-shot classification architec- 



























































ture. We learn a domain transfer model between 
the visual domain and the privileged information 
domain. This facilitates predicting explicit visual 
classifiers for novel unseen categories given their 
privileged information. The difference in archi¬ 
tecture becomes clear if we consider, for the sake 
of argument, attributes as the secondary domain in 
our framework, although this is not the focus of the 
paper. In that case we do not need explicit ahribute 
classifiers fo be learned as an infermediafe layer as 


fypically done in ahribule-based ZSL e.g. (Lam- 

perl el al., 2009 

Farhadi el al., 2009 

Palalucci el 

al., 2009 

1, insfead fhe visual classifier are direcfly 


learned from fhe aflribule labels. The need fo learn 
an intermediate affribufe classifier layer in mosf 
allribufe-based zero-shof learning approaches dic- 
fafes using sfrongly annofafed dafa, where each 
image comes wifh affribufe annofafion, e.g. CU- 
Birds dafasef ( Welinder ef ah, 2010 1. In confrasf, 
we do nof need image-annofafion pairs, and privi¬ 
leged informafion is only assumed af fhe category 
level; hence we denofe our approach weakly su¬ 
pervised. This also direcfly facililales using con¬ 
tinuous allribufes in our case, and does nof assume 
independenf befween affribufes. 


Anofher fundamenlal difference in our case is 
fhaf we predicf explicif kernel classifier in fhe form 
defined in fhe represenfer fheorem ( [Scholkopf ef 
I, from privileged informafion. Explicif 
predicfion means fhaf fhe oufpuf of our 
framework is classifier paramefers for any new 
cafegory given fexf descripfion, which can applied 
fo any fesf image fo predicf ifs class. Predicfing 
classifier in kernelized form opens fhe door for us¬ 
ing any kind of side informafion abouf classes, as 
long as kernels can be defined on fhem. The im¬ 
age feafures also do nof need fo be in a vectorized 
formal. Kernelized classifiers also facililales com¬ 
bining differenl lypes of feafures Ihrough a multi- 
kernel learning (MKL) paradigm, where fhe fusion 
of differenl feafures can be effeclively achieved. 


al., 2001 


classifier 


We can summarize fhe feafures of our proposed 
framework, hence fhe conlribufion as follows: 1) 
Our framework explicilly predicls classifiers; 2) 
The predicled classifiers are kernelized; 3) The 
framework facililales any lype of “side” informa¬ 
tion fo be used; 4) The approach requires fhe side 
informafion al fhe class level, nof al fhe image 
level, hence, if needs only weak annofafion. 5) 
We propose a disfribufional semanfic kernel be- 
Iween lexl descripfion of visual classes fhaf we 


show ifs value in fhe experimenls. The slruclure 
of fhe paper is as follows. Sec [^describes fhe re¬ 
lation to existing liferafure. Secand [^explains 
fhe learning selling and our formulalion. Sec 
presenls fhe proposed disfribufional semanfic ker¬ 
nel for lexl descripfions. Sec shows our experi- 
menlal resulfs. 

2 Related Work 


We already discussed fhe relafion fo fhe zero-shof 
learning liferafure in fhe Infroducfion secfion. In 
Ihis secfion, we focus on fhe relations fo olher vol¬ 
umes of liferafure. 

There has been increasing inleresl recenlly in 
fhe infersecfion befween Language and Compufer 
Vision. Mosf of fhe work on Ihis area is fo¬ 
cused on generating lexlual descripfion from im¬ 
ages ( Farhadi ef al., 20T0| Kulkarni el al., 2011 


Ordonez ef al., 20TT] Yang el al., 2011 Milchell 


el al., 20121. In confrasf, we focus on generating 


visual classifiers from lexlual description or olher 
side informafion af fhe category level. 

There are few recenl works fhaf involved unan- 
nofaled lexl fo improve visual classification or 


achieve zero-shof learning. In (Frome el al., 2013 


Norouzi ef al., 2014] ) and ( |Socher ef al., 2013] ), 


word embedding language models (e.g. (Mikolov 


ef al., 201 3b| ) was adopted fo represenl class 
names as vecfors. Their framework is based on 
mapping images info fhe learned language mode 
fhen perform classification in fhaf space. In con- 
Irasl, our framework maps fhe lexl informafion fo 
a classifier in fhe visual domain, i.e. fhe appo- 
sile direction of fheir approach. There are sev¬ 
eral advanlages in mapping lexlual knowledge info 
fhe visual domain. To perform ZSL, approaches 
such as ( Norouzi ef al., 2014} [Frome ef al., 20131 
Socher ef al., 201^ only embed new classes by 


fheir category names. This has clear limitations 
when dealing with fine-grained categories (such 
as differenl bird species). Mosf of fine-grained 
cafegory names does nof exisl in currenf semantic 
models. Even if Ihey exisl, Ihey will end up close 
to each olher in fhe learned language models since 
Ihey fypically share similar conlexls. This limils 
fhe discriminalive power of such language models. 
In facl our baseline experimenf using Ihese models 
performed as low as random when applied to fine¬ 
grained cafegory; described in Sec 6.4 Moreover, 
our framework direcfly can use large lexl descrip- 


lion of novel calegories. In confrasf fo (Norouzi ef 

al, 2014 

Frome el al., 2013 

Socher ef al., 20131 





































































which required a vectorized representation of im¬ 
ages, our framework facilitates non-linear classifi¬ 
cation using kernels. 


In (EUioseiny et ah, 20131, an approach was 


proposed to predict linear classifiers from textual 
description, based on a domain transfer optimiza¬ 


tion method proposed in (Kulis et ah, 20111. Al¬ 
though both of these works are kemelized, a close 
look reveals that kernelization was mainly used to 
reduce the size of the domain transfer matrix and 
the computational cost. The resulting predicted 
classifier in ( Elhoseiny et ah, 2013| l is still a lin¬ 
ear classifier. In contrast, our proposed formula¬ 
tion predicts kemelized visual classifiers directly 
from the domain transfer optimization, which is a 
more general case. This directly facilitates using 
classifiers that fused multiple visual cues such as 
Multiple Kernel Learning (MKL). 


3 Problem Definition 

We consider a zero-shot multi-class classification 
setting on domain X as follows. At training, be¬ 
sides the data points from X and the class labels, 
each class is associated with privileged informa¬ 
tion in a secondary domain S in particular, how¬ 
ever not limited to, a textual description. We as¬ 
sume that each class m G ysc(training/seen la¬ 
bels), is associated with privileged information 
Ci G E. While, our formulation allows mul¬ 
tiple pieces of privileged information per class 
{e.g. multiple class-level textual descriptions), we 
will use one per class for simplicity. Hence, 
we denote the training as Vtrain = {Sx = 
{{xi,yi)}N,Se = {yj,ej}NsJ> where xt G X, 
yi G Ysc, yj G Ysc, and Nsc and N are the 
number of the seen classes and the training ex¬ 
amples/images respectively. We assume that each 
of the domains is equipped with a kernel func¬ 
tion corresponding to a reproducing kernel Hilbert 
space (RKHS). Let us denote the kernel for X 
by fe(-, •) and the kernel for £ by p(-, •). At the 
zero-shot time, only the privileged information e^* 
is available for each novel unseen class z*', see 

FigB 

The common approach for multi-class classifi¬ 
cation is to learn a classifier for each class against 
the remaining classes (i.e., one-vs-all). According 


to the generalized representer theorem (Schdlkopf 


et ah, 20011, a minimizer of a regularized empir¬ 
ical risk function over an RKHS could be repre¬ 
sented as a linear combination of kernels, evalu¬ 


ated on the training set. Adopting the representer 
theorem on classification risk function, we define 
a kernel-classifier of class y as follows 

N 

fy{Y) = ^ Xyk{x* ,Xi) + h ^ f3y^ii{x*), ( 1 ) 

i=l 

where x* G A is the test point, Xi G Sx, 
k(x*) = [k{x*,xi),-■ ■ ,k{x*,XN),lV, (3y = 
Wy'" 1 Having learned fy{x*) for each 
class y (for example using SVM classifier), the 
class label of the test point x* can be predicted as 

y* = argmax/y(x*) (2) 

y 

It is clear that /^(x*) could be learned for all 
classes with training data y G Ygc = 2/i • • • 2/jv > 
since there are examples Sx for the seen classes; 
we denote the kernel-classifier parameters of the 
seen classes as Bsc = {Py}NscJ^y G Ygc- How¬ 
ever, it is not obvious how to predict /^* (x*) for 
a new unseen class z* G Yus = - . Our 

main notion is to use the privileged information 
Cz* G £, associated with unseen class z*, and the 
training data Vtrain to directly predict the unseen 
kernel-classifier parameters. Hence, the classifier 
of z* is a function of e^* and Vtrain', i-^- 

/^.(x*) = (3{ez* yVirainf ' kf**), (3) 

fz*{'x*) could be used to classify new points that 
belong to an unseen class as follows: 1) one-vs- 
all setting fz* (x*) ^ 0 ; or 2) in a Multi-class 
prediction as in Eq 

4 Approach 

Prediction of (d{ez* , Vtrain), which we denote as 
P{ez*) for simplicity, is decomposed into training 
(domain transfer) and prediction phases. 

4.1 Domain Transfer 

During training, we firstly learn Bsc as SVM- 
kemel classifiers based on Sx- Then, we learn 
a domain transfer function to transfer the privi¬ 
leged information e £ £ to kernel-classifier pa¬ 
rameters f3 G in X domain. We call this 

function j3]^j^{e), which has the form of T^g(e), 
where g(e) = b(e, ei) • • • p(e, T is an 

Nsc X A -|- 1 matrix, which transforms e to ker¬ 
nel classifier parameters for the class e represents. 

We aim to learn T, such that g(e)^Tk(x) > 
I if e and x correspond to the same class, 
g(e)'^Tk( x) < u otherwise. Here I controls sim¬ 
ilarity lower-bound if e and x correspond to same 









class, and u controls similarity upper-bound if e 
and X belong to different classes. In our setting, 
the term T^g(ei) should act as a classifier pa¬ 
rameter for class i of the training data. There¬ 
fore, we introduce penalization constraints to our 
minimization function if T^g(ej) is distant from 
/3j G Bsc, where a corresponds to the class that 
/3j classifies. Inspired by domain adapfafion op- 
fimizafion mefhods (e.g. ( Kulis ef al., 201 1| )), we 
model our domain Iransfer funcfion as follows 


r =argmmL(T) = [^rfT)-T Ai ^ c;.(GTK) + 

k 

Nsc: 

A2^||/3,-TTg(e0f 


(4) 


where, G is an Ngc x Ngc symmefric mafrix, such 
fhaf bofh fhe row and fhe column are equal 
to 6* * ^ •^e', K is an At -|- 1 X At mafrix, such 
thaf fhe column is equal fo k(xi), Xi G Sx- 
Ck ’s are loss functions over fhe consfrainfs defined 
as Cfc(GTK)) = {max{0,{l - ijGTKlj))^ 
for same class pairs of index i and j, or = r • 
(max(0,(ljGTKlj — n)))^ ofherwise, where Ij 
is an Nsc x 1 vecfor wifh all zeros excepf al index 
i, Ij is an At X 1 vector wifh all zeros excepf al 
index j. This leads fo Cfc(GTK) = max{0, (Z — 
g(ei)^Tk(xj)))^ for same class pairs of index i 
and j, or = r-{max{0, (g(ej)'''Tk(xj)—u)))^ olh- 
erwise, where u > 1, r = — such lhal nd and ns 
are fhe number of pairs (z,j) of differenl classes 
and similar pairs respecfively. Finally, we used a 
Frobenius norm regularizer for r(T). 

The objecfive funcfion in Eq |^conlrols fhe in- 
volvemenl of fhe consfrainfs by fhe lerm multi¬ 
plied by Al, which conlrols ifs imporlance; we call 
if C; ,i(T). While, fhe Irained classifiers penally 
is caplured by fhe lerm mulliplied by A 2 ; we call 
if C'y 3 (T). One imporlanl observation on C'^(T) 
is lhal if reaches zero when T = where 

B = [(3i ■ ■ ■ since if could be rewritten as 

G^(T) = pT - GTp. 

One approach fo minimize T(T) is gradienl- 
based oplimizalion using a quasi-Newfon opti¬ 
mizer. Our gradienf derivafion of T(T) leads to 
fhe following form 

=T+Ai.^g(e0k(x,)%+r.A2.(G'T-GB) 

* J 

(5) 

where = -2 • max(0, (Z - g(ei)’'’Tk(a:j))) 
if i and j correspond fo fhe same class, 2 • 
max{0, (g(ei)'''Tk(xj) — u) ofherwise. Anolher 


approach fo minimize T(T) is Ihrough alternating 


projecfion using Bregman algorilhm (Censor and 


Zenios, 19971, in which T is updated wifh respecf 


to a single consfrainf every ileralion. 


4.2 Classifier Prediction 

We propose Iwo ways fo predicl fhe kernel- 
classifier. (1) Domain Transfer (DT) Predicfion, 
(2) One-class-SVM adjusted DT Predicfion. 
Domain Transfer (DT) Prediction: Construction 
of an unseen category is directly computed from 
our domain transfer model as follows 


^DT(ez*) = T*'^g(e^.) (6) 


One-class-SVM adjusted DT (SVM-DT) Pre¬ 


diction: In order to increase separability against 
seen classes, we adopted the inverse of the idea of 
the one class kemel-svm, whose main idea is to 
build a confidence function that takes only posi¬ 
tive examples of the class. Our setting is the op¬ 
posite scenario; seen examples are negative exam¬ 
ples of the unseen class. In order introduce our 
proposed adjustment method, we start by present¬ 
ing the one-class SVM objective function. The La- 


grangian dual of the one-class SVM (Evangelista 


et al., 20071 can be written as 


13'^ =argmin[/3^K f3 — /3^a] 

/3 

s.t. : /3^1 = 1,0 < /3i < C; i = 1 • • • A 


(7) 


where K is an A x A7 matrix, K.'{i,j) = 
k{xi, Xj), \/xi, Xj G Sx {Le. in the training data), 
a is an A/^ X 1 vector, sa = k{xi, Xi), C is & hyper¬ 
parameter . It is straightforward to see that, if fi is 
aimed to be a negative decision function instead, 
the objective function becomes in the form 

/3l =argmin[/3^K /3-1-/3^ a] 

^ ( 8 ) 
s.t. : 13^1 = -1, -G < < 0; i = 1 ■ • • A 


While I3*_ = —/3+, the objective function in 
Eq of the one-negative class SVM inspires us 
with the idea to adjust the kernel-classifier param¬ 
eters to increase separability of the unseen kernel- 
classifier against the points of the seen classes, 
which leads to the following objective function 

^(e^.) =argmiii[/3^K (3 - /3 -|- /3^a] 

13 

s.t. : /3^1 = —(3 > I, —C < 13^ < 0;Vi 
C, 1: hyper-parameters, 


( 9 ) 










where jSj^rp is the first N elements in g 

1 is an iV X 1 vector of ones. The ob¬ 
jective function, in Eq pushes the classifier of 
the unseen class to be highly correlated with the 
domain transfer prediction of the kernel classi¬ 
fier, while putting the points of the seen classes 
as negative examples. It is not hard to see that 
Eq|^is a quadratic program in /3, which could be 
solved using any quadratic solver; we used IBM 
CPEEX. It is worth to mention that, the approach 
in (Elhoseiny et ah, 20131 predicts linear classi¬ 
fiers by solving an optimization problem of size 
N + dx + '^ variables {dx + 1 linear-classifier pa¬ 
rameters and N slack variables); a similar limita¬ 
tion can be found in (|Erome et ah, 2013] [Socher 


et ah, 2013|l. In contrast, our objective func¬ 


tion in Eq solves a quadratic program of only 
N variables, and predicts a kernel-classifier in¬ 
stead, with fewer parameters. Hence, if very high¬ 
dimensional features are used, they will not affect 
our optimization complexity. 


vector of term frequencies and V is an M x iT 
matrix of the corresponding term vectors. 

Given two text descriptions Di and Dj which 
contains Mi and M 2 terms respectively. We com¬ 
pute Fi (Mi X 1) and Vj (Mj x K) for Di and 
Fj (Mj X 1) and \j (Mj x K) for Dj. Finally 
9Ds{Di, Dj) is defined as 

gDs{D^,D,)=FJ\i\]Fj (10) 

One advantage of this similarity measure is that it 
captures semantically related terms. It is not hard 
to see that the standard Term Frequency (TF) sim¬ 
ilarity could be thought as a special case of this 
kernel where vec{wi)'^vec[wm) = 1 if = Wm, 
0 otherwise, i.e., different terms are orthogonal. 
However, in our case the word vectors are learnt 
through a distributional semantic model which 
makes semantically related terms have higher dot 
product {vec{wi)^vec{wm))- 

6 Experiments 


5 Distributional Semantic (DS) Kernel 
for text descriptions 

When 8 domain is the space of text descrip¬ 
tions, we propose a distributional semantic ker¬ 
nel = 5 ds(', ■) fo define the similarity be¬ 

tween two text descriptions . We start by dis¬ 


tributional semantic models by (Mikolov et ah. 


2013c Mikolov et ah, 2013aI to represent the se¬ 
mantic manifold M.s, and a function vec{-) that 
maps a word to a iT x 1 vector in Af<j. The 
main assumption behind this class of distribu¬ 
tional semantic model is that similar words share 
similar context. Mathematically speaking, these 
models learn a vector for each word Wn, such 
that p{Wn\iWn-L,Wn-L+l,--- , Wn+L-1, Wu+l) 
is maximized over the training corpus, where 2xL 
is the context window size. Hence similarity be¬ 
tween vec{wi) and vec{wj) is high if they co¬ 
occurred a lot in context of size 2 x L in the train¬ 
ing text-corpus. We normalize all the word vectors 
to length 1 under E2 norm, i.e., ||r;ec(-)|p = 1. 

Eet us assume a text description D that 
we represent by a set of triplets D = 
{{wufuvec{wi)),l = where wi is 

a word that occurs in D with frequency fi 
and its corresponding word vector is vec{wi) 
in M.S- We drop the stop words from D. 
We define F = [/i,-- - ,/m]^ and V = 

[vec{wi), ■ ■ ■ , vec{wM)]^, where F is an M x 1 


6.1 Datasets and Evaluation Methodology 

We validated our approach in a fine-grained setting 
using two datasets: 1) The UCSD-Birds dataset 
( |Welinder et ah, 20T0| , which consists of 6033 im¬ 
ages of 200 classes. 2) The Oxford-Flower dataset 
( |Nilsback and Zisserman, 2008| l, which consists 
of 8189 images of 102 flower categories. Both 
datasets were amended with class-level text de¬ 
scriptions extracted from different encyclopedias 
which is the same descriptions used in ([Elhoseiny 


et ah, 20131; see samples in the supplementary 


materials. We split the datasets to 80% of the 
classes for training and 20% of the classes for test¬ 
ing, with cross validations. We report multiple 
metrics while evaluating and comparing our ap¬ 
proach to the baselines, detailed as follows 

Multiclass Accuracy of Unseen classes (MAU): 
Under this metric, we aim to evaluate the perfor¬ 
mance of the unseen classifiers against each oth¬ 
ers. Firstly, the classifiers of all unseen cafegories 
are predicfed. Then, an insfance x* is classified to 
the class z* G Yus of maximum confidence for x* 
of the predicted classifiers; see Eq[^ 

AUC: In order to measure the discriminative 
ability of our predicted one-vs-all classifier for 
each unseen class, against the seen classes, we re¬ 
port the area under the ROC curve. Since unseen 
class positive examples are few compared to nega¬ 
tive examples, a large accuracy could be achieved 
even if all unseen points are incorrectly classified. 

























Hence, AUC is a more consistent measure. In this 
metric, we use the predicted classifier of an un¬ 
seen class as a binary separator against the seen 
classes. This measure is computed for each pre¬ 
dicted unseen classifier and the average AUC is re¬ 


ported. This is the only measure addressed in (El- 


hoseiny et ah, 2013]) to evaluate the unseen classi¬ 


fiers, which is limiting in our opinion. 


\Nsc\to |Asc -I- l\Recall: Under this metric, we 
aim to check how the learned classifiers of the seen 
classes confuse the predicted classifiers, when 
they are involved in a multi-class classification 
problem of Nsc + 1 classes. We use Eq to pre¬ 
dict label of an instance x*, such that the unknown 
label y* € Ysc u lus, such that Ls is the label of the 
unseen class. We compute the recall under this set¬ 
ting. This metric is computed for each predicted 
unseen classifier and the average is reported. 


6.2 Comparisons to Linear Classifier 
Prediction 


We compared our proposed approach to (Elho- 
seiny et ah, 2013| l, which predicts a linear clas¬ 
sifier for zero-shot learning from textual descrip¬ 
tions ( S space in our framework). The aspects 
of the comparison includes 1) whether the pre¬ 
dicted kernelized classifier outperforms the pre¬ 
dicted linear classifier 2) whether this behavior 
is consistent on multiple datasets. We performed 
the comparison on both Birds and Elower dataset. 
Eor these experiments, in our setting, domain A” is 
the visual domain and domain £ is the textual do¬ 
main, i.e. , the goal is to predict classifiers from 
pure textual description. We used the same fea¬ 
tures on the visual domain and the textual domains 


as (Elhoseiny et ah, 20131. That is, for the vi¬ 


sual domain, we used classeme features (Torre- 


sani et ah, 20101, extracted from images of the 
Bird and the Elower datasets. Classeme is a 2569- 
dimensional features, which correspond to confi¬ 
dences of a set of one-vs-all classifiers, pre-trained 


on images from the web, as explained in (Torre- 


sani et ah, 20101, not related to either the Bird nor 


the Elower datasets. The rationale behind using 


these features in (Elhoseiny et ah, 20131 was that 
they offer a semantic representation. Eor the tex¬ 
tual domain, we used the same textual feature ex¬ 
tracted by ( [Elhoseiny et ah, 2013 ). In that work, 
tf-idf (Term-Erequency Inverted Document Ere- 


quency)(Salton and Buckley, 19881 features were 
extracted from the textual articles were used, fol- 


Table 1: Recall, MAU, and average AUC on 
three seen/unseen splits on Elower Dataset and a 


seen/unseen split on Birds dataset 



Recall-Flower 

improvement 

Recall-Birds 

improvement 

SVM-DT kernei-rbf 1 

40.34% i+l- 1.2) % 


44.05 % 


Linear Classifier I 

31.33 (+/. 2.22)% 

27.8 % 

36.56 % 

20-4* 1 


MAU-Flower j 

improvement 

MAU-Birds | 

improvement I 

SVM-DT kemel-rbf 

9.1 (+/- 2.77) % 1 


3.4 % 

1 

DT kerael-rbf 

6.64 (+/- 4.1) % 

37.93 % 

2.95 % 

15.25 % 

Linear Classifier 

5.93 (-!-/- 1.48)% 

54.36 % 

2.62 % 

29.77 % 1 

Domain Transfer 

5.79 (+/-2.59)% 1 

58.46 % 

2.47 % j 

37.65 % 1 



AUC-Flower 

1 improvement 

AUC-Birds 

j improvement 

SVM-DT kemel-rbf 

0.653 (-1-/- 0.009) 


0,61 


DT kernel-rbf 

1 0.623 (-t-/-0.01)% 

1 4.7 % 

0.57 

1 7.02 % 

Linear Classifier 

1 0.658 (-1-/-0.034) 

1 - 0.7 % 

0.62 

1 -1.61% 

Domain Transfer 

1 0.644 (-!-/-0.008) 

1 1.28% 

0,56 

1 8.93% 


lowed by a CLSI (jZeimpekis and Gallopoulos^ 


20051 dimensionality reduction phase. 


We denote our DT prediction and one class 
SVM adjust DT prediction approaches as DT- 
kernel and SVM-DT-kernel respectively. We 
compared against the linear classifier prediction 
by ([Elhoseiny et ah, 20131. We also compared 


against the direct domain transfer (Kulis et ah. 


20111, which was applied as a baseline in (El- 


hoseiny et ah, 201^ to predict linear classifiers. 


In our kernel approaches, we used Gaussian rbf- 
kemel as a similarity measure in £ and X spaces 
{i.e. k{d, d!) = exp{—\\\d — ci'||)). 

Recall metric : The recall of our approach is 
44.05% for Birds and 40.34% for Elower, while 
it is 36.56% for Birds and 31.33% for Elower us¬ 


ing (Elhoseiny et ah, 20131. This indicates that the 
predicted classifier is less confused by the classi¬ 


fiers of the seen compared with (Elhoseiny et ah, 
20131; see table [T](top part) 


MAU metric: It is worth to mention that the 
multiclass accuracies for the trained seen classi¬ 
fiers are 51.3% and 15.4% using the classeme fea¬ 
tures on Elower dataset and Birds dataseQ re¬ 
spectively. Table [T] (middle part) shows the av¬ 
erage MAU metric over three seen/unseen splits 
for Elower dataset and one split on Birds dataset, 
respectively. Eurthermore, the relative improve¬ 
ments of our SVM-DT-kemel approach is reported 
against the baselines. On Elower dataset, it is in¬ 
teresting to see that our approach achieved 9.1% 
MAU, 182% improvement over the random guess 
performance, by predicting the unseen classifiers 
using just textual features as privileged informa¬ 
tion (i.e. £ domain). We also achieved also 13.4%, 
268% the random guess performance, in one of the 
splits (the 9.1% is the average over 3 seen/unseen 


'Birds dataset is known to be a challenging dataset for 
fine-grained, even when applied in a regular multiclass setting 
as it is clear from the 15.4% performance on seen classes 
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Figure 2: AUC of the 62 unseen classifiers the 
flower data-sets over three different splits (bottom 
part) and their Top 10 ROC-curves (top part) 

splits). Similarity on Birds dataset, we achieved 
3.4% MAU from text features, 132% the random 
guess performance (further improved to 224% in 
next experiments). 

AUC metric: Fig (top part) shows the ROC 
curves for our approach on the best predicted un¬ 
seen classes from the Flower dataset. Fig (bot¬ 
tom part) shows the AUC for all the classes on 
Flower dataset (over three different splits). More 
results and figures are attached in the supplemen¬ 
tary materials. Table [T] (bottom part) shows the 
average AUC on the two datasets, compared to the 
baselines. 

Looking at table [T] we can notice that the pro¬ 
posed approach performs marginally similar to the 
baselines from AUC perspective. However, there 
is a clear improvement in MAU and Recall met¬ 
rics. These results show the advantage of pre¬ 
dicting classifiers in kernel space. Furthermore, 
the table shows that our SVM-DT-kemel approach 
outperforms our DT-kernel model. This indicates 
the advantage of the class separation, which is ad¬ 
justed by the SVM-DT-kernel model. More de¬ 
tails on the hyper-parameter selection are attached 
in the supplementary materials. 

6.3 Multiple Kernel Learning (MKL) 
Experiment 

This experiment shows the added value of propos¬ 
ing a kernelized zero-shot learning approach. We 
conducted an experiment where the final kernel on 
the visual domain is produced by Multiple Kernel 
Learning ( |Gonen and Alpaydin, 201 1| . For the vi¬ 
sual domain, we extracted kernel descriptors for 


Table 2: MAU on a seen-unseen split-Birds 
Dataset (MKL) 



MAU 

improvement 

SVM-DT kernel-rbf (text) 

4.10 % 


Linear Classifier 

2.74 % 

49.6 % 


Birds dataset. Kernel descriptors provide a prin¬ 
cipled way to turn any pixel attribute to patch- 
level features, and are able to generate rich fea¬ 
tures from various recognition cues. We specifi¬ 
cally used four types of kernels introduced by ( |Bo| 
|et ah, 20T0| ) as follows: Gradient Match Kernels 
that captures image variation based on predefined 
kernels on image gradients. Color Match Kernel 
that describes patch appearance using two kernels 
on top of RGB and normalized RGB for regu¬ 
lar images and intensity for grey images. These 
kernels capture image variation and visual apper- 
ances. For modeling the local shape. Local Binary 
Pattern kernels have been applied. 

We computed these kernel descriptors on lo¬ 
cal image patches with fixed size 16 x 16 sam¬ 
pled densely over a grid with step size 8 in a spa¬ 
tial pyramid setting with four layers. The dense 
features are vectorized using codebooks of size 
1000. This process ended up with a 120,000 di¬ 
mensional feature for each image (30,000 for each 
type). Having extracted the four types of de¬ 
scriptors, we compute an rbf kernel matrix for 
each type separately. We learn the bandwidth pa¬ 
rameters for each rbf kernel by cross validation 
on the seen classes. Then, we generate a new 
kernel kmki{d,d') = wAfid,d'), such that Wi 
is a weight assigned to each kernel. We learn 
these weights by applying Bucak’s Multiple Ker¬ 
nel Learning algorithm (Bucak et ah, 2010 1 . Then, 
we applied our approach where the MKL-kemel is 
used in the visual domain and rbf kernel on the text 
TFIDF features. 


To compare our approach to (Elhoseiny et ah. 


20131 under this setting, we concatenated all ker¬ 


nel descriptors to end up with 120,000 dimen¬ 
sional feature vector in the visual domain. As 
highlighted in the approach Sec the approach 
in (Elhoseiny et ah, 20131 solves a quadratic pro¬ 
gram of A -I- dx -I-1 variables for each unseen class. 
Due to the large dimensionality of data (dx = 
120,000), this is not tractable. To make this set¬ 
ting applicable, we reduced the dimensionality of 
the feature vector into 4000 using PCA. This high¬ 
lights the benefit of our approach since it does not 
depend on the dimensionality of the data. Table 
shows MAU for our approach under this setting 























































































Table 3: MAU on a seen-unseen split-Birds 
Dataset (CNN features, text description) 



MAU 

improvement 

SVM-DT kernel (A’-rbf, £-DS kernel) 

5.35 % 


SVM-DT kernel (^-rbf, ^-rbf on TFIDF) 

4.20 % 

27.3% 

Linear Classifier (TFIDF text) 

2.65 % 

102.0% 

(Norouzi et al., 20^ 

2.3% 

132.6% 


against (Elhoseiny et al., 20131. The results show 
the benefits of having a kernel approach for zero 
shot learning where kernel methods are applied to 
improve the performance. 


6.4 Multiple Representation Experiment and 
Distributional Semantic(DS) Kernel 

The aim of this experiment is to show that 
our approach also work on different representa¬ 
tions of text and visual domain. In this exper¬ 
iment, we extracted Convolutional Neureal Net- 
work(CNN) image features for the Visual domain. 
We used caffe (Jia et al., 2014|) implementation 


of (Krizhevsky et al., 2012). Then, we extracted 
the sixth activation feature of the CNN since we 
found it works the best on the standard classifica¬ 
tion setting. We found this consistent with the re¬ 
sults of ( [Donahue et al., 20T4 ) over different CNN 
layers. While using TFIDF feature of text descrip¬ 
tion and CNN features for images, we achieved 
2.65% for the linear version and 4.2% for the rbf 
kernel on both text and images. We further im¬ 
proved the performance to 5.35% by using our 
proposed Distributional Semantic (DS) kernel in 
the text domain and rbf kernel for images. In this 
DS experiment, we used the distributional seman¬ 


tic model by (Mikolov et al., 2013c) trained on 
GoogleNews corpus (100 billion words) resulting 
in a vocabulary of size 3 million words, and word 
vectors oi K = 300 dimensions. This experiment 
shows both the value of having a kernel version 
and also the value of the proposed kernel in our 
setting. We also applied the zero shot learning ap¬ 
proach in ( Norouzi et al., 2014| ) which performs 
worse in our settings; see Table 


6.5 Attributes Experiment 

We emphasis that our main goal is not attribute 
prediction. However, it was interesting for us to 
see the behavior of our method where side infor¬ 
mation comes from attributes instead of text. In 
contrast to attribute-based models, which fully uti¬ 
lize attribute information to build attribute classi¬ 
fiers, our approach do nof learn affribufe classi¬ 
fiers. In fhis experimenf, our mefhod uses only 
fhe firsl momenf of informafion of fhe affribules 


Table 4: MAU on a seen-unseen splif-Birds 
Dafasef (Affribules) 



MAU 

improvement 

SVM-DT kemel-rbf 

5.6 % 

_ 

DT kemel-rbf 

4.03 % 

32.7 % 

Lampert DAP 

4.8 % 

16.6 % 


(i.e. fhe average affribufe vector). We decided lo 
compare lo an affribufe-based approach from fhis 
perspective. In parlicular, we applied fhe (DAP) 
allribule-based model (jUampert ef al., 20T4| Fam- 


perl el al., 2009), widely adopled in many appli¬ 


cations {e.g., (Fiu ef al., 2013[ Rohrbach el al.. 


2011)), to fhe Birds dafasef. Defails weak affribufe 


represenlalion in £ space are allached in fhe sup- 
plemenlary malerials due lo space. For visual do¬ 
main X, we used classeme fealures in fhis experi- 
menl (like fable [T] experimenf) 

An interesting resull is lhal our approach 
achieved 5.6% MAU (224% fhe random guess per¬ 
formance); see Table In confrasl, we gel 4.8% 
mulficlass accuracy using DAP approach (jUam- 


perl el al., 2014). In fhis selling, we also measured 
fhe Nsc to Nsc + 1 average recall. We found the 
recall measure is 76.7% for our SVM-DT-kemel, 
while it is 68.1% on the DAP approach, which 
reflects better true positive rate (positive class is 
the unseen one). We find Ihese resulls inleresling, 
since we achieved if wifhoul learning any affribufe 
classifiers, as in ( jUamperf ef al., 2014| ). When 
comparing fhe resulls of our approach using al- 
Iribules (Table|^ vs. lexlual descripfion (Table[T]|^ 
as fhe privileged informafion used for predicfion, 
if is clear lhal fhe affribufe fealures gives better 
predicfion. This supporf our hypofhesis lhal fhe 
more meaningful fhe £ domain, fhe better fhe per¬ 
formance on X domain. 


7 Conclusion 


We proposed an approach lo predicl kernel- 
classifiers of unseen calegories lexlual descrip¬ 
tion of fhem. We formulated fhe problem as do¬ 
main transfer function from the privilege space 
£ to the visual classification space X, while sup¬ 
porting kernels in both domains. We proposed a 
one-class SVM adjustment to our domain transfer 
function to improve the prediction. We validated 
the performance of our model by several exper¬ 
iments. We applied our approach using different 
privilege spaces {i.e. £ lives in a textual space or an 
attribute space). We showed the value of propos- 

^We are refering to the experiment that uses classeme as 
visual features to have a consistent comparison to here 














































ing a kernelized version by applying kernels gen¬ 
erated by Multiple Kernel Learning (MKL) and 
achieved better results. We also compared our ap¬ 
proach with state-of-the-art approaches and inter¬ 
esting findings have been reported. 
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